Search CORE

9 research outputs found

Timely Long Tail Identification through Agent Based Monitoring and Analytics

Author: Garraghan PM
Ouyang X
Townend P
Xu J
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/04/2015
Field of study

The increasing complexity and scale of distributed systems has resulted in the manifestation of emergent behavior which substantially affects overall system performance. A significant emergent property is that of the "Long Tail", whereby a small proportion of task stragglers significantly impact job execution completion times. To mitigate such behavior, straggling tasks occurring within the system need to be accurately identified in a timely manner. However, current approaches focus on mitigation rather than identification, which typically identify stragglers too late in the execution lifecycle. This paper presents a method and tool to identify Long Tail behavior within distributed systems in a timely manner, through a combination of online and offline analytics. This is achieved through historical analysis to profile and model task execution patterns, which then inform online analytic agents that monitor task execution at runtime. Furthermore, we provide an empirical analysis of two large-scale production Cloud data enters that demonstrate the challenge of data skew within modern distributed systems, this analysis shows that approximately 5% of task stragglers caused by data skew impact 50% of the total jobs for batch processes. Our results demonstrate that our approach is capable of identifying task stragglers less than 11% into their execution lifecycle with 98% accuracy, signifying significant improvement over current state-of-the-art practice and enables far more effective mitigation strategies in large-scale distributed systems worldwide

Crossref

Lancaster E-Prints

White Rose Research Online

Dependability in Federated Cloud Environments

Author: Garraghan PM
Townend PM
Xu J
Publication venue
Publication date: 29/09/2011
Field of study

Cloud Computing has emerged as a large-scale distributed system model for utility computing, whereby services are supplied on-demand. It has been proposed that Clouds are in the process of evolving from single, monolithic Clouds such as EC2 or Microsoft Azure serving many consumers to a federation of autonomous Clouds. However, there remain a number of research challenges in building dependable and robust Clouds; a critical research problem that has not yet to be fully understood. This paper discusses the issues and challenges surrounding Cloud dependability, and outlines research areas of opportunity for improving the dependability and robustness of federated Clouds

White Rose Research Online

Virtual Machine Level Temperature Profiling and Prediction in Cloud Datacenters

Author: Garraghan PM
Jiang X
Li X
Wu Z
Ye K
Zomaya A
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2016
Field of study

Temperature prediction can enhance datacenter thermal management towards minimizing cooling power draw. Traditional approaches achieve this through analyzing task-temperature profiles or resistor-capacitor circuit models to predict CPU temperature. However, they are unable to capture task resource heterogeneity within multi-tenant environments and make predictions under dynamic scenarios such as virtual machine migration, which is one of the main characteristics of Cloud computing. This paper proposes virtual machine level temperature prediction in Cloud datacenters. Experiments show that the mean squared error of stable CPU temperature prediction is within 1.10, and dynamic CPU temperature prediction can achieve 1.60 in most scenarios

Lancaster E-Prints

White Rose Research Online

Reliable Computing Service in Massive-scale Systems Through Rapid Low-cost Failover

Author: Feng Y
Garraghan PM
Li C
Ouyang J
Xu J
Yang R
Zhang Y
Zhang Z
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/03/2016
Field of study

Large-scale distributed systems deployed as Cloud datacenters are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely adopted means to achieve such a goal is using redundant system components to implement user-transparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed-an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, such as timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71 percent additional CPU usage

Crossref

Lancaster E-Prints

White Rose Research Online

An Analysis of the Server Characteristics and Resource Utilization in Google Cloud

Author: Garraghan P
Townend PM
Xu J
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

Understanding the resource utilization and server characteristics of large-scale systems is crucial if service providers are to optimize their operations whilst maintaining Quality of Service. For large-scale datacenters, identifying the characteristics of resource demand and the current availability of such resources, allows system managers to design and deploy mechanisms to improve datacenter utilization and meet Service Level Agreements with their customers, as well as facilitating business expansion. In this paper, we present a large-scale analysis of server resource utilization and a characterization of a production Cloud datacenter using the most recent datacenter trace logs made available by Google. We present their statistical properties, and a comprehensive coarse-grain analysis of the data, including submission rates, server classification, and server resource utilization. Additionally, we perform a fine-grained analysis to quantify the resource utilization of servers wasted due to the early termination of tasks. Our results show that datacenter resource utilization remains relatively stable at between 40 - 60%, that the degree of correlation between server utilization and Cloud workload environment varies by server architecture, and that the amount of resource utilization wasted varies between 4.53 - 14.22% for different server architectures. This provides invaluable real-world empirical data for Cloud researchers in many subject areas

Crossref

White Rose Research Online

An Approach for Characterizing Workloads in Google Cloud to Derive Realistic Resource Utilization Models

Author: Garraghan P
Solis Moreno I
Townend PM
Xu J
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2013
Field of study

Analyzing behavioral patterns of workloads is critical to understanding Cloud computing environments. However, until now only a limited number of real-world Cloud datacenter tracelogs have been available for analysis. This has led to a lack of methodologies to capture the diversity of patterns that exist in such datasets. This paper presents the first large-scale analysis of real-world Cloud data, using a recently released dataset that features traces from over 12,000 servers over the period of a month. Based on this analysis, we develop a novel approach for characterizing workloads that for the first time considers Cloud workload in the context of both user and task in order to derive a model to capture resource estimation and utilization patterns. The derived model assists in understanding the relationship between users and tasks within workload, and enables further work such as resource optimization, energy-efficiency improvements, and failure correlation. Additionally, it provides a mechanism to create patterns that randomly fluctuate based on realistic parameters. This is critical to emulating dynamic environments instead of statically replaying records in the tracelog. Our approach is evaluated by contrasting the logged data against simulation experiments, and our results show that the derived model parameters correctly describe the operational environment within a 5% of error margin, confirming the great variability of patterns that exist in Cloud computing

Crossref

Lancaster E-Prints

White Rose Research Online

Using Byzantine Fault-Tolerance to Improve Dependability in Federated Cloud Computing

Author: Garraghan PM
Townend PM
XU J
Publication venue: Institute of Software, Chinese Academy of Sciences (ISCAS)
Publication date: 30/07/2013
Field of study

Computing Clouds are typically characterized as large scale systems that exhibit dynamic behavior due to variance in workload. However, how exactly these characteristics affect the dependability of Cloud systems remains unclear. Furthermore provisioning reliable service within a Cloud federation, which involves the orchestration of multiple Clouds to provision service, remains an unsolved problem. This is especially true when considering the threat of Byzantine faults. Recently, the feasibility of Byzantine Fault-Tolerance within a single Cloud and federated Cloud environments has been debated. This paper investigates Cloud reliability and the applicability of Byzantine Fault-Tolerance in Cloud computing and introduces a Byzantine fault-tolerance framework that enables the deployment of applications across multiple Cloud administrations. An implementation of this framework has facilitated in-depth experiments producing results comparing the reliability of Cloud applications hosted in a federated Cloud to that of a single Cloud

White Rose Research Online

An Analysis of Failure-Related Energy Waste in a Large-Scale Cloud Environment

Author: Garraghan PM
Solis Moreno I
Townend PM
Xu J
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/02/2014
Field of study

Cloud computing providers are under great pressure to reduce operational costs through improved energy utilization while provisioning dependable service to customers; it is therefore extremely important to understand and quantify the explicit impact of failures within a system in terms of energy costs. This paper presents the first comprehensive analysis of the impact of failures on energy consumption in a real-world large-scale cloud system (comprising over 12 500 servers), including the study of failure and energy trends of the spatial and temporal environmental characteristics. Our results show that 88% of task failure events occur in lower priority tasks producing 13% of total energy waste, and 1% of failure events occur in higher priority tasks due to server failures producing 8% of total energy waste. These results highlight an unintuitive but significant impact on energy consumption due to failures, providing a strong foundation for research into dependable energy-aware cloud computing

Crossref

Lancaster E-Prints

White Rose Research Online

Straggler Detection in Parallel Computing Systems through Dynamic Threshold Calculation

Author: Garraghan P
McKee D
Ouyang X
Townend PM
Xu J
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2016
Field of study

Cloud computing systems face the substantial challenge of the Long Tail problem: a small subset of straggling tasks significantly impede parallel jobs completion. This behavior results in longer service response times and degraded system utilization. Speculative execution, which create task replicas at runtime, is a typical method deployed in large-scale distributed systems to tolerate stragglers. This approach defines stragglers by specifying a static threshold value, which calculates the temporal difference between an individual task and the average task progression for a job. However, specifying static threshold debilitates speculation effectiveness as it fails to consider the intrinsic diversity of job timing constraints within modern day Cloud computing systems. Capturing such heterogeneity enables the ability to impose different levels of strictness for replica creation while achieving specified levels of QoS for different application types. Furthermore, a static threshold also fails to consider system environmental constraints in terms of replication overheads and optimal system resource usage. In this paper we present an algorithm for dynamically calculating a threshold value to identify task stragglers, considering key parameters including job QoS timing constraints, task execution characteristics, and optimal system resource utilization. We study and demonstrate the effectiveness of our algorithm through simulating a number of different operational scenarios based on real production cluster data against state-of-the-art solutions. Results demonstrate that our approach is capable of creating 58.62% less replicas under high resource utilization while reducing response time up to 17.86% for idle periods compared to a static threshold

Lancaster E-Prints

White Rose Research Online